Employee turnover is costly. An inability to retain talent forces a company to frequently retrain employees. Additionally general lack of continuity within a company creates a variety of business challenges. This lead us to ask the question: Can we predict when an employee is likely to leave? If so we could use a targeted raise program in order to keep attrition rates low.
Included below is a summary of the data set, prior to processing. It includes information about the amount of time an employee has spent at their current job, with their current manager, how far their commute is, and a variety of other variables that provide information on their current work environment.
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
After checking our correlogram we decided to cluster with all 15 numeric values, however, our clustering was quite unsuccessful. The Withinss is displayed below.
## [1] 0.1225601
Data partition
This model provided an accuracy of 0.8597. However, the sensitivity (TPR) was 0.38 and the specificity (about 0.96) and this raises concerns that the model is overlearning the majority class. Thus, this made us think to balance the dataset to prevent overlearning the majority class
##
## No Yes
## 533 497
In an effort to mitigate the risk of overlearning the minority class, We trained another rpart2 model using a dataset where we oversample the minority class to make the dataset balanced. The new prevalence rate of the former minority class was now about 45%. Using this model to predict the test set, our sensitivity and specificity rates both converged around 0.72, but since accuracy dropped to 0.7466, this model was dropped.
used_vars <- c("MonthlyIncome", "TotalWorkingYears", "YearsWithCurrManager", "JobRole", "YearsAtCompany", "Age", "OverTime", "EnvironmentSatisfaction", "JobLevel", "NumCompaniesWorked")
Accuracy dropped to 0.8371. ROC was 0.69 (no increase from using the full feature set). Sensitivity and Specificity concerns persisted
Selected maxdepth=16, Even with adjusting the maxdepth hyperparameter, the accuracy, sensitivity, and specificity mostly stayed the same. Sensitivity and Specificity concerns persisted
## rpart2 variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## MonthlyIncome 100.00
## JobRole 98.08
## OverTime 94.92
## TotalWorkingYears 76.04
## YearsWithCurrManager 69.72
## StockOptionLevel 50.87
## NumCompaniesWorked 49.75
## YearsAtCompany 46.71
## EmployeeNumber 42.49
## EnvironmentSatisfaction 41.02
## WorkLifeBalance 39.52
## DailyRate 37.98
## JobLevel 36.52
## Department 30.49
## DistanceFromHome 24.08
## HourlyRate 19.96
## EducationField 19.13
## MonthlyRate 19.05
## RelationshipSatisfaction 18.70
## YearsSinceLastPromotion 15.35
## rpart2 variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## StockOptionLevel 100.000
## MonthlyIncome 74.743
## YearsWithCurrManager 51.580
## TotalWorkingYears 46.193
## JobLevel 41.304
## EnvironmentSatisfaction 39.201
## OverTime 37.398
## DistanceFromHome 34.579
## RelationshipSatisfaction 30.854
## YearsAtCompany 24.743
## JobRole 23.894
## YearsSinceLastPromotion 21.723
## MonthlyRate 20.630
## EmployeeNumber 18.236
## Age 17.928
## TrainingTimesLastYear 16.318
## YearsInCurrentRole 15.690
## DailyRate 15.567
## WorkLifeBalance 12.825
## EducationField 9.875
## rpart2 variable importance
##
## Overall
## MonthlyIncome 100.000
## YearsWithCurrManager 90.451
## OverTime 77.037
## JobRole 73.494
## TotalWorkingYears 69.563
## YearsAtCompany 47.905
## EnvironmentSatisfaction 26.555
## JobLevel 19.969
## NumCompaniesWorked 4.954
## Age 0.000
## rpart2 variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## MonthlyIncome 100.00
## JobRole 98.08
## OverTime 94.92
## TotalWorkingYears 76.04
## YearsWithCurrManager 69.72
## StockOptionLevel 50.87
## NumCompaniesWorked 49.75
## YearsAtCompany 46.71
## EmployeeNumber 42.49
## EnvironmentSatisfaction 41.02
## WorkLifeBalance 39.52
## DailyRate 37.98
## JobLevel 36.52
## Department 30.49
## DistanceFromHome 24.08
## HourlyRate 19.96
## EducationField 19.13
## MonthlyRate 19.05
## RelationshipSatisfaction 18.70
## YearsSinceLastPromotion 15.35
Our original Random Forest was so bad that we didn’t even keep the code. It predicted that only 4 people at any point would quit. We changed every parameter we could think of, but the simple truth was we didn’t have enough data, our data was too imbalanced, and we had too many features. After re balancing the data and shoring up the feature space to test different rpart2 models, we decided to see how this would change our results with Random Forest.
We used the MTry tune function to get a MTry of about 6. Additionally, we started with sample sizes of 100 and 1000 different trees. We saw increased success with this model, compared to the first RF atleast, but wanted to continue to tinker with the parameters.
We increased sample size from 100 to 200, OOB error dropped as we expected. Additionally, we dropped the number of trees from 1000 to 500.The high number of features and the relatively limited number of rows in this dataset meant that our random forest model was facing an overlearning problem. By changing the degrees of freedom ratio by restricting the number of features, we thought this would help mitigate the overlearning issue.
## Confusion Matrix and Statistics
##
## Actual
## Prediction No Yes
## No 150 13
## Yes 35 23
##
## Accuracy : 0.7828
## 95% CI : (0.7226, 0.8353)
## No Information Rate : 0.8371
## P-Value [Acc > NIR] : 0.986238
##
## Kappa : 0.3609
##
## Mcnemar's Test P-Value : 0.002437
##
## Sensitivity : 0.6389
## Specificity : 0.8108
## Pos Pred Value : 0.3966
## Neg Pred Value : 0.9202
## Precision : 0.3966
## Recall : 0.6389
## F1 : 0.4894
## Prevalence : 0.1629
## Detection Rate : 0.1041
## Detection Prevalence : 0.2624
## Balanced Accuracy : 0.7248
##
## 'Positive' Class : Yes
##
This RF model produced accuracy = 0.78, Sensitivity=0.83, Specificity=0.78. Although using the balanced dataset improved the overlearning problem that we had, accuracy was worse than just guessing randomly.
After having only marginal success with decision trees, we decided to move to a KNN approach. We were not able to get accuracy that was significantly above the No Information Rate. We figured that because our dataset was quite imbalanced, we would get a marginally better model with KNN.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 192 14
## 1 2 13
##
## Accuracy : 0.9276
## 95% CI : (0.8851, 0.9581)
## No Information Rate : 0.8778
## P-Value [Acc > NIR] : 0.01146
##
## Kappa : 0.5826
##
## Mcnemar's Test P-Value : 0.00596
##
## Sensitivity : 0.48148
## Specificity : 0.98969
## Pos Pred Value : 0.86667
## Neg Pred Value : 0.93204
## Prevalence : 0.12217
## Detection Rate : 0.05882
## Detection Prevalence : 0.06787
## Balanced Accuracy : 0.73559
##
## 'Positive' Class : 1
##
After Finalizing our KNN model we evaluated based on the following additional metrics:
Shown above is our LogLoss followed by the baseline. Our LogLoss of .89 is significantly lower than our baseline LogLoss rate of 1.8. This is encouraging because it means that our model is not highly confident in its classifications in the wrong direction. With such an imbalanced dataset, we were quite satisfied with this metric.
Our F1 Score of .62 is much higher than the baseline F1 we calculated of .2776. This is additionally encouraging given the imbalance of our dataset.
In the end, we built a model that is quite successful at predicting attrition. Based on the business value of predicting attrition, we wanted our model to be quite certain that it would predict every person that was going to leave, even if this led to some false positives. This is important given how much more expensive it would be to retrain new employees as compared to targeted salary raises to convince current employees to stay. Our data set was challenging: it contained relatively little data and the data it did contain was quite overbalanced. Because of this imbalance, we really struggled to beat the no information rate with decision trees. Ensemble methods were messy and worse than our individual trees: even after balancing our data set and reducing the feature space. Ultimately we had the most success with a KNN model, at a relatively low K. This model is advantageous because it is uncomplicated and predicts significantly better than the no information rate and any other model we built. This model will save this company money: they will be able to predict and incentivize employees likely to leave in order to avoid retraining and high turnover.
Future work that would benefit this model would be gathering more data. We had a small dataset that made many advanced methods challenging to use. Especially with a dataset so imbalanced, it would be quite beneficial to gather more data. Additionally, with more company specific data, this model could be modified to ensure profitability (in the sense of giving employees raises vs. letting them leave). Models that failed to meet certain metrics could be further tailored to a companies desires.